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^ , Biomedical studies have a common interest in assessing relationships between multiple related health outcomes 
^ I and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes 
^ | such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple 
CN ' candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods 
for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To 
address this problem, one may either consider linear regression models for the continuous outcomes or convert 

these outcomes into binary indicators of adverse responses using pre-defined cutoffs. The former strategy has 

1 > ■ 

. the disadvantage of often leading to a poorly fitting model that does not predict risk well, while the latter 

, 

c/3 , approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for 

\ adverse subpopulation regression (ASPR), which relies on a two component latent class model, with the dominant 

T 7^ | component corresponding to (presumed) healthy individuals and the risk of falling in the minority component 

r^ - ) ' characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional 

■ predictors, as occur in studies with a large number of gene by environment interactions, through use of a flexible 

If} • nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. The 

^—i . methods are evaluated using simulation studies and applied to a genetic epidemiology study of pregnancy outcomes. 

^ Copyright © 2011 John Wiley & Sons, Ltd. 
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1. Introduction 

Biomedical studies routinely collect multiple quantitative health outcomes and investigate how the risk of having adverse 
values for these outcomes is associated with predictors. The typical approach in such setting is to 1) use multivariate 
normal linear regression in which the mean of the response distribution varies linearly with predictors; 2) first categorize 
the responses based on pre-specified cutoffs and then fit an ordinal logistic regression. The former approach is insufficiently 
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flexible to accommodate settings in which the predictors do not simply shift the response by a fixed amount for all 
individuals, while the latter approach is extremely sensitive to cut-point choices. In this article, we propose a simple 
alternative approach for adverse subpopulation regression (ASPR) relying on a two component mixture model that 
incorporates a logistic regression for the risk of falling into the minority component in the mixture, with the logistic 
regression model accommodating high-dimensional predictors. We focus on the case when researchers are interested in 
dichotomizing the subjects into two classes: healthy and unhealthy group (corresponding to the majority and minority 
of the population); and each component is modeled by the multivariate normal distribution whose mean vector and 
covariance matrix change with latent class membership. This model is purposefully chosen to be simple to facilitate 
analyses and interpretations in settings involving high-dimensional predictors, though generalizations to multiple latent 
classes is straightforward as discussed in Section 6. 

Our approach is motivated by the Healthy Pregnancy, Healthy Baby (HPHB) Study, a prospective cohort study of 
pregnant women residing within Durham County, NC with the goal of identifying environmental, social and genetic factors 
that contribute to racial disparities in birth outcomes [1]. Here we focus on assessing how predictors - a large number 
of maternal candidate gene single nucleotide polymorphisms (SNPs), environmental exposures, and their interactions 
- impact the risk of low values of infant birth weight and gestational age at delivery. Such research questions cannot 
be addressed by the standard linear regression with continuous responses, where one models the predictor effects on 
the response means. The standard approach is instead to dichotomize the quantitative outcomes into binary indicators, 
such as low birth weight (LBW, birth weight<2500g), preterm birth (PTB, gestational age at delivery <37weeks) and 
small for gestational age (SGA, birth weight less than 10th percentile for that gestational age), and then apply logistic 
regression. While such analyses are easily implemented, they rely on pre-defining thresholds with the analysis results 
varying significantly according to the threshold choice [2]. 

We propose an alternative method for adverse subpopulation regression, which relies on a two component latent class 
model [3, 4, 5], with the component weights dependent on predictors via logistic regression. Related approaches are 
considered by Gage [6], Gage et al. [7] and Schwartz et al. [8], but they focused on models with fixed component weights 
and with the means varying with predictors. In addition, our emphasis is on applications involving high-dimensional 
predictors in which maximum likelihood can be expected to have poor performance. 

In such settings, it has become quite common to rely on either Bayesian methods or penalized likelihoods with penalties 
incorporated to favor having many coefficients estimated at or near zero, leading to variable selection and an effectively 
lower dimensional model. In linear regression and generalized linear models, such methods have become standard, with 
the Lasso [9], elastic net [10] and relevance vector machine [11] providing popular examples. The penalized likelihood 
estimators have a Bayesian interpretation in corresponding to the mode of the posterior distribution obtained under 
carefully chosen priors on the coefficients, with the Laplace leading to the Lasso [12] and a i-distribution with low degrees 
of freedom leading to the relevance vector machine. MacLehose and Dunson [13] recently proposed a new class of multiple 
shrinkage priors that allow shrinkage towards not only zero but also other values, leading to improved performance in 
estimating non-zero coefficients. We will consider these and other shrinkage priors in the context of our ASPR model. 

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed Bayesian adverse 
subpopulation regression model, describing both fully Bayes and fast two-stage approaches for inference. Section 3 
provides details of an Markov chain Monte Carlo (MCMC) algorithm. Section 4 presents the simulation results, evaluating 
and comparing the proposed methods with existing methods. In Section 5, we apply the model to pregnancy outcome data. 
The article concludes with a discussion in Section 6. 
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2. Bayesian Adverse Subpopulation Regression 

2.1. Model Formulation 

Suppose we collect the data (y x^) for subject i, i = 1,2,..., n, where y i is an s x 1 vector of outcomes and x, t is ap x 1 
vector of predictors. We make the simplifying assumption that there are two types of individuals, with m = denoting 
healthy individuals and z; = 1 for potentially unhealthy individuals. In addition, we assume for identifiability that the 
unhealthy individuals are in the minority, with the specific constraints and prior information included for identifiability 
discussed in detail in Section 2.2. This is a simplification which is made for ease in interpretation, assessment of risk, 
and scaling to higher dimensions while accommodating the curse of dimensionality that arises. In many cases, such 
a simplification is made in advance of the analysis by taking one or more response variables and defining cutoffs to 
dichotomize the data prior to analysis. However, it is well known that results are quite sensitive to the choice of cutoff 
[14], and hence we prefer allowing z% to be an adverse health status latent variable. By using a Bayesian approach, we can 
fully accommodate uncertainty in imputing Zi and avoid forcing any hard threshold on the observed quantitative traits. 

Denote uji(xi) = Pr(z; = 1 | Xi) as the probability of allocating subject i to the unhealthy population and let u} 2 {xi) = 
1 — LJi(xi). We then express the conditional density of the response y i given predictors Xi as 

2 

f(Vi I xi) =J2"h(*i)K(yi I o h ,n h ) t (l) 

h=l 

where N s (y i | Oh,^h) is the s-dimensional Normal distribution with mean vector Oh and covariance matrix S^. We 
assume that variability in the healthy and unhealthy groups in the quantitative traits is adequately characterized by a 
multivariate normal distribution for the sake of identifiability. Potentially, one could instead use a heavier-tailed distribution 
such as a multivariate t-distribution, though allowing outlying individuals in the two groups may degrade separation of the 
two components. For example, an unhealthy individual may be in the tails of the t-distribution for the healthy component. 
In addition, uj\{xi) in expression (1) depends on predictors Xi through a logistic regression model: 

: r p( v:^ v (2) 

1 + cxp(7 + x\(5) 

where the coefficient vector (3 = , (3 2 , ■ ■ ■ , P P )' characterizes the effect of predictors on the risk of falling in the minority 
subpopulation. Due to the logistic regression form, the exponentiated f3j coefficients can be interpreted as odds ratios. 



2.2. Prior Specification 

For the ASPR model in (l)-(2), identifiability of the unhealthy subpopulation necessarily relies on prior information. 
In the absence of some prior knowledge, the two subgroups would be exchangeable, and we would encounter a label- 
ambiguity problem. Removal of this problem through appropriate priors is one of the advantages of the simple two 
component framework over more complex latent class regression models having unknown numbers of components. The 
most common approach would place restrictions on the means of the components; for example, ordering the components 
in advance by letting 9n < 9 2 i- This approach assumes that low values of the first response variable are adverse, which 
may be reasonable for a given study but is not so in general. Moreover, placing restriction on the means will fail to solve 
the label ambiguity problem if the components are not separated sufficiently. 

Thus, we consider alternative strategies depending on the application. The first is to elicit informative values for 
the means and covariance matrix in the two components from prior empirical knowledge of the typical distribution of 
the responses in healthy and unhealthy groups. In the absence of such extra knowledge, we may fit a mixture of two 
multivariate normals to the data using EM for maximum likelihood estimation, defining the minority component to be 
adverse. This can be done either from historical data or the current data. Then, fix the Oh, S/i at these estimates in the 
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subsequent analysis. This runs the risk of under-estimating uncertainty but has the advantage of simplifying interpretation 
and completely eliminating identifi ability concerns. 
Another alternative is to specify conditionally conjugate prior distributions for 9h, and 7 as follows, 

(0 h , S fc ) ~ NIW a (0 fc ,E fc I 6> ,Vo,Po,So), h=l,2, 

where NIW s (0/j, | 9 , i/j , p , S ) is the Normal-inverse- Wishart distribution proportional to 
|S/ 1 |-( s+ ' ,0+2 )/ 2 cxp{-itr(Soi:^ 1 ) - ^(9 h - e (i ) l Yr 1 (e h - e )}. As weakly informative empirical Bayes priors, 
the hyperparameters O and S are chosen to be the sample means and covariance matrix for all subjects, and we set 
ipo — 1 and po = s + 2 to reduce the prior information. Additionally, we place an informative prior on the intercept in the 
logistic regression model (2) after centering the predictors, 7 ~ Ni(7 | 70, Ao). For example, by choosing 70 = —2.20 
and Ao = 2.42 for the intercept, the expected baseline probability (a priori) of an adverse response is 10% and falls in the 
range between 3% and 30% with 0.95 probability. 

As for the priors for (3, if ccj is low-dimensional, we can rely on standard choices, such as independent Gaussian 
distributions with modest variance. However, as the number of predictors increases, we need some approach for addressing 
the high dimensionality. A common strategy in the frequentist literature is to use sparse penalized regression (e.g., Lasso, 
elastic net, etc) to favor many elements of j3 that are equal to zero while shrinking the non-zero elements toward zero. In the 
Bayesian literature, a rich variety of shrinkage priors have been proposed for high-dimensional regression coefficients, with 
most approaches relying on priors that are centered at zero, potentially with a point mass incorporated to allow variable 
selection. Hierarchical shrinkage priors that are centered at zero can potentially lead to over-shrinkage of coefficients that 
are not close to zero. Such over-shrinkage can be reduced by choosing a prior which is concentrated near zero with very 
heavy tails, but in that case there is no borrowing of information or incorporation of prior knowledge in estimating the 
coefficients that are not close to zero. 

As an alternative approach that had excellent performance in high-dimensional logistic regression, MacLehose and 
Dunson [13] proposed a multiple shrinkage prior (MSP) /?_, ~ J DE(/?j | pj,Tj)dP(p,j,Tj), where DE(/3j | fJ,j,Tj) is the 
double exponential (Laplace) distribution with location parameter pj and scale parameter rf, the mixture distribution P is 
assigned a modified Dirichlet process prior that incorporates a mass at pj = for the first component. In stick-breaking 
form [15], this prior for P(-) can be expressed as 

00 

t=2 

(4 ~ N i(/4 I c,d), 

Ti ~ Gamma(r 1 * | a ,b ), r t * - Gamma(r t * | 01,61), 
n = V t JlO--Vi), V r t ~Beta(l,a), 

Kt 

where Gamma(Y | a, b) = l/[b a T(a)]T a ~ 1 exp(— r/b) with mean a x b; Sg(-) is the probability measure with all its mass 
at 9; following MacLehose and Dunson [13], we choose c = 0, ao = b a = 30, a\ = bi = 6.5 and a = 1. Consequently, 
through MSP, the coefficients (3 will be shrunk to multiple locations p*s, including zero in the first cluster (t = 1), 
corresponding to the usual Bayesian Lasso prior, while the other components are centered at unknown locations away from 
zero. For j3j and/3j< belonging to the same cluster t, (3j ^ (3j> with£ , (/3 J ) = E((3j>) = p^ andVar((3j) = Var((3j>) = r t * 2 . 
In our application of the ASPR model, it is unlikely that any of the predictors being considered have a log-odds ratio of 
falling in the adverse sub-group outside of j3j e [—1, 1], corresponding to an interval of [0.37,2.72] for the odds ratio. 
In most genetic epidemiology studies involving complex health conditions, one expects at most a modest deviation 
from a log-odds ratio at 1.00 for single SNPs or SNP x environment interactions. This small signal-to-noise ratio is 
one aspect that makes detection of important variants so challenging. To express this prior information, while inflating 
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the prior variance somewhat to corresponding to a "weakly informative" prior [16], we let d = 0.1507, which leads to 
Pr (p* t G [-1, 1]) = 0.99 a priori. 



3. Posterior Computation 

In describing an approach for posterior computation, we focus on the approach described in Section 2 that places a normal- 
inverse-Wishart prior on the component-specific parameters, an informative prior on the intercept 7 for identifiability, 
and a mixture of double exponential shrinkage prior on the high-dimensional vector of j3 coefficients. This approach is 
straightforward to modify to accommodate the other approaches described in Section 2. For example, to instead use the 
two-stage plug-in approach, we would run the EM algorithm first to estimate ph, £/i for h = 1, 2 and then would hold 
these component-specific parameters fixed in the proposed data augmentation Gibbs sampling algorithm to be described 
below. In addition, if an alternative shrinkage prior were used for the coefficients f3j , then one could simply modify the 
sampling steps for updating the /3j appropriately. For scale mixture of normal priors, such as double exponentials, t priors 
or other standard choices, this is straightforward. 

If we observe the latent subpopulation index Zi directly for each individual and are interested in the coefficients 
(3, then we could apply the MCMC algorithm of MacLehose and Dunson [13] directly. However, because we do 
not observe z; for any of the subjects, we instead modify their algorithm to include steps for imputing z, from the 
corresponding full Bernoulli conditional posterior distribution and sampling the mean and covariance specific to each 
component. We start by relating the latent subpopulation index z% = I(gi > 0) to an auxiliary random variable g t , 
where /(•) is the indicator function, which equals 1 when gi > and otherwise. To induce expression (2) through 
marginalizing gi, we assume gi follows a logistic distribution centered on 7 + x'fi. Holmes and Held [17] proposed a 
data augmentation MCMC algorithm for posterior computation in logistic regression models relying on characterizing the 
latent g^ as a scale mixture of normals, with the square root of the scale parameters following a Kolmogorov-Smirnov 
(KS) distribution. Due to lack of conjugacy of the conditional posteriors of scale parameters specific to each subject, 
they recommend using rejection sampling. However, use of a large number of rejection sampling steps can lead to 
inefficiencies, so we instead apply an alternative data augmentation scheme. Following O'Brien and Dunson [18], the 
logistic distribution can be almost exactly approximated by a noncentral t-distribution t v (gi | 7 + a;-/3, a 2 ) = Ni(g; | 
7 + x'if3, cr 2 /(/)i)Gamma(</>.; | v/2, 2/v)d(pi, when we set a 2 = ix 2 (y — 2)/2v with degree of freedom v = 7.3. Kinney 
and Dunson [19] showed that posterior distributions of g, estimated with the Holmes and Held [17] and O'Brien and 
Dunson [18] algorithms are essentially completely indistinguishable given sufficient numbers of MCMC samples. We 
outline the Gibbs sampler for Bayesian adverse subpopulation regression in the following steps: 

(a) Draw h and S h from HW s {O h , E h \ § hl $ h , p h , t h ), ft = 1,2, where 

d h = —rVh + — —0o, 

n h + ipo n h + ipo 

4'h =n h + ip , 
Ph = n h + po, 

t h = So + s h + 1 ■ - —r iVh - o )(y h - e y, 

1 + n h ip Q 

with m = £)" =1 1(zi = 1), y x = i Ei:» i= i Vi and s i = T.f.z^iiVi - ViliVi ~ Vi)'< "2 = £"=1 I(zl = 0), y 2 = 

(b) Impute component indicator m from the conditional Bernoulli distribution by setting z% = 1 with probability 

"i(" i )N.(i >< |9 1 ,S 1 ) f • _ j 2 

ELi"'.(* i )N.(i; i |fl h ,E h ) IOT 1 ~ ■ ■ ■ ' n - 
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(c) Augment auxiliary variable g t , i = 1, 2, . . . , n, sampled from the normal distribution Ni(gi | 7 + a 2 /4>i), which 
is truncated above (below) by zero when m = (zi = 1). 

(d) Update fa from Gamma (fa | *±i, ^rr^ppy^ ), i = 1,2, . . . ,n. 

(e) Update the regression coefficients /3* = (7, /3')' given 0j, ^ and other parameters, following the MCMC algorithm of 
MacLehose and Dunson [13]. 

Although we illustrate the algorithm focusing on the multiple shrinkage prior, the above algorithm could be easily modified 
for different shrinkage priors of f3 by using the corresponding sampling algorithm in the step (e). Moreover, we could 
combine the shrinkage and selection method (e.g. Lasso and elastic net) for logistic regression with step (a) and (b) to get 
a Monte Carlo EM algorithm [20] for the adverse subpopulation regression. 

4. Simulations 

In this section, we examine the performance of our approach along with alternative simple two-stage methods through 
simulation studies. In the two-stage methods, the binary indicators for the adverse subpopulation are generated in the first 
stage. They are chosen as the true binary indicators (known for simulation data), the maximum a posteriori allocation 
from a simple two component mixture model estimated via maximum likelihood implemented with the EM algorithm, or 
values obtained using prespecified cutoffs. In the second stage, we fit both the standard logistic regression model without 
penalization and the penalized logistic regression models with the shrinkage methods Lasso and elastic net [21]. 

One hundred datasets were simulated to represent the data observed in the HPHB data set. In particular, we simulated 
813 women with two response variables corresponding to infant birth weight and gestational age at delivery. We used 
maternal genotype for 100 SNPs as predictors that were fixed across the simulations, with only the response variable 
generation varying. By using the real SNP data, we obtained simulated datasets with a realistic dependence structure 
among the predictors, which is important given that the dependence structure can have a fundamental impact on variable 
selection and estimation performance. We simulated data under the model proposed in Section 2.1, with f3 chosen so 
that the first ten elements were set equal to 0.8 (corresponding to odds ratios for the minor allele of exp(0.8) = 2.23) 
and the remaining elements were set equal to zero (corresponding to no association with risk of falling in the adverse 
subpopulation for SNPs 11, ... , 100). We simulated y i based on expression (1), where the Oh and £/j were set equal to the 
maximum likelihood estimates from the HPHB dataset by using a two component latent class model without predictors. 

We applied the proposed ASPR model with default priors specified in Section 2.2 and the two-stage methods to the 
simulated datasets. For ASPR model, we implemented the data augmentation Gibbs sampling algorithm outlined in 
Section 3. The sampling ran for 11,000 iterations, 1,000 iterations were discarded as a burn-in and every 10th sample 
was saved to thin the chain. The trace and autocorrelation plots of the posterior samples were examined to determine the 
convergence. We used the cyclical coordinate descent algorithm by Friedman et al. [21] to find the Lasso or elastic net 
regularization paths for penalized logistic regression models. Ten-fold cross-validations were used to select the optimal 
shrinkage parameter which gave the minimal deviance. 

We first compared the estimation performance measured by mean squared errors (MSEs) which were calculated for 
each coefficient across 100 datasets. Table 1 presents the averaged MSEs obtained across the first ten non-null coefficients 
and the remaining ninety null coefficients. The averaged MSEs of non-null coefficients by ASPR model are smaller than 
those given by the two-stage methods even with true indicators. More importantly, when the true indicators are unknown, 
a common scenario in practice, the two-stage methods rely on indicators generated either by classification algorithm(here 
the maximum a posteriori allocation) or by medical cutoffs will inflate the MSEs of non-null coefficients significantly. 
This observation indicates that if part of subjects are mis-classified in the two-stage methods, the coefficient estimates 
would be affected in the standard and penalized logistic regressions. A better solution is thus to avoid the classification 
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Table 1. Simulation results to compare coefficient estimation and variable selection by the ASPR model, the standard 
logistic regression models without penalization and the penalized standard logistic regression models with shrinkage 

methods. 





ASPR-MSP 


Truth 
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0.274/0.471 


interval, length 










111 UJJ11U.11/ 1> LIU 1 


n QS^/o 43Q 


1 809/3 401 

1 . QUA j.tUl 


1 849/5 651 


1 499/3 983 


TPR 


286 


683 


570 


525 


FPR 


010 


U. IOj 


U. lOO 


144 


AUC 
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0.274/0.005 


0.513/0.004 


0.381/0.003 
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0.790 


0.701 


0.646 
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0.141 


0.114 


0.097 
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0.866 


0.821 


0.794 








Logit-ElasticNet 




MSE 










(Nonnull/Null) 




0.233/0.005 


0.470/0.004 


0.359/0.003 


TPR 




0.905 


0.838 


0.771 


FPR 




0.242 


0.210 


0.175 


AUC 




0.919 


0.876 


0.845 



or using cutoffs in the first place. In addition to demonstrating smaller MSEs, for ASPR model we can easily obtain the 
credible intervals of coefficient based on posterior samples. As shown by Table 1, the averaged lengths of 90% credible 
interval by ASPR model is much narrower than 90% confidence intervals by standard logistic regression models for both 
the non-null and null coefficients. 

We also compared the variable selection ability for different methods. The predictor f3j is selected if its 90% credible 
or confidence interval dos not contain zero for the ASPR model and standard logistic regression model, or if estimate of 
(3j is not equal to the zero for the penalized logistic regression models. The true positive rate (TPR) and false positive 
rate (FPR) in selection are presented in Table 1 . Compared to the other methods, the ASPR model achieves both lower 
TPR and FPR, which can be increased if we use the narrower credible interval. Although the low TPR in ASPR model 
is undesirable, the low FPR may be beneficial for the study in which the false selection of predictors will lead to serious 
consequence. In applications involving thousands to hundreds of thousands of genetic markers, it is critical to control 
the FDR at a low level to avoid producing hundreds of false findings. Based on the substantive knowledge, the predictor 
Pj may also be selected if \/3j\ > e. Figure 1 illustrates the receiver operating characteristic (ROC) curve, which plots the 
TPRs and FPRs as e is varied. It it clear that the ROC curve by ASPR model is closer to the top left corner with higher 
value of area under the curve (AUC, shown in Table 1), which suggests a better trade-off between the TPR and FPR in 
terms of variable selection. 



5. Application to Pregnancy Outcomes 

There is increasing appreciation that interactions between the genetic and environmental factors contribute to adverse 
birth outcomes. In this analysis, we investigated the effects of maternal genotype and their interaction with lead and 
tobacco exposure on adverse birth outcomes in the infant, adjusting for several confounding factors. The dataset included 
813 non-Hispanic black pregnant women who had singleton pregnancy and were less than 28 weeks gestation at the 
time of enrollment in HPHB study. Based on published studies, we focused on 31 candidate genes which are involved 
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Figure 1. Receiver Operating Characteristic (ROC) curves for different methods: by ASPR-MSR by the two-stage methods with true indicators, - ■ ■ by the two-stage 

methods with subjects allocated by maximum a posteriori, — ■ — by the two-stage methods with indicators defined by cutoffs. 



with maladaptive inflammatory regulation, maternal-fetal circulation, stress response, and environmental contaminant 
metabolism. For those candidate genes, we selected 275 haplotype tagging SNPs which effectively capture the genetic 
diversity of these genes. Please see Swamy et al. [22] and Ashley-Koch et al. [23] for further details on genotyping 
approaches. A detailed description of the SNPs and genes used in this analysis can be found in the Web Appendix. For the 
purpose of this analysis, we assumed that the risk for adverse birth outcomes would be associated with minor alleles. The 
value of each SNP was recorded as one if the mother carried the less frequent allele and as zero otherwise. In addition to 
the genetic data, we measured maternal blood levels of lead and cadmium. The interaction of lead and cadmium with the 
SNPs in relation to gestational age and birth weight is an important research question. We also controlled confounding 
factors by including them in the analysis. These confounders are mother's age, recorded as age group 18-20, 21-35 vs 
35+; education, as no college vs some college; insurance, as private vs others; parity, as zero vs others; infant sex, as male 
vs female. 

We fit the ASPR model with default priors and ran the MCMC algorithm for 11,000 iterations with the first 1,000 
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Table 2. Posterior means and quantiles for component parameters in the ASPR model. 
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Figure 2. Scatter plots of birth weight (grams) and gestational age (days) overlaid with (a) posterior mean of allocation weight u>2 with vertical dash line at 259 and horizontal 
dash line at 2500; (b) posterior predictive density of observations (o). 



iterations discarded as burn-in and every 10th remaining draw retained for analysis. The trace plots and the autocorrelation 
plots suggested the algorithm converged fast and mixes well. Table 2 presents the posterior summary for the component 

parameter 9^ = (6hi,6h2)' ar >d S/, = ( ^ hn ^ h12 j phe table indicates that the healthy group in general has longer 

gestational age with higher birth weight, compared to the unhealthy group. In addition, the subjects in the healthy group 
are more homogeneous with the smaller values in the components of E^. Figure 2(a) shows shaded circles at the raw data 
points, with the darkness of the shading being proportional to the estimated posterior probability of allocation to the healthy 
subgroup. Standard cutoffs for defining preterm birth and low birth weight are also shown. Although most of the children 
that are in the preterm and low birth weight bin have small posterior probabilities of allocation to the healthy subgroup, 
there is substantial uncertainty around the boundary region in particular. This uncertainty is taken into the account by the 
ASPR model but not by the other two-stage approaches. Figure 2(b) plots the raw data and also demonstrates the contours 
of posterior predictive density based on the MCMC samples of ASPR model. There seems no systematic discrepancy 
between the observations and the contours of posterior predictive density, suggesting the ASPR model fits the data well. 

Figure 3 shows the posterior means and 90% credible intervals for /3j,j = l,..., 833. Although all the credible intervals 
cover zero, the posterior mean of (3 66 and /3i 82 , given in Table 3, stay clearly away from zero in the main effect (the 
top panel). This result is also supported by Web Figure 1, which plots the histogram of = Yl g =i I > e )/G> the 
posterior probability that jth predictor have an effect based on G samples of fy. When we take e = 0.1, 7T66 stands out 
with the value 0.119 and ni 82 with the value 0.094 while the majority of other posterior probabilities are around 0.05. (3qg 
corresponds to SNP rs3740564 located in gene GRK5 and /3 182 corresponds to SNP rsl0109780 located in gene NATL 
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Table 3. List of SNPs which have largest estimated SNP effects for different methods. 
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Figure 3. Posterior mean and 90% credible interval for coefficients f3 in ASPR model with MSP. The coefficients are iitustrated for main effects and interactions respectively. 



GRK5 is a member of the G protein-coupled receptor kinase family which is involved in regulating the activity levels 
of G protein-coupled receptors. Polymorphisms in GRK5 have been previously linked to risk for heart failure in African 
Americans [24]. The N-acetyltransferase genes (NAT1 and NAT2) are involved in the metabolism of xenobiotics. NAT1 
has been shown to be expressed in early placenta [25]. We also analyzed the data using two-stage methods with the 
penalized logistic regressions and the indicators generated by maximum a posteriori method. The results are presented 
in Table 3. Both penalized logistic regression methods identifies the predictor X 66 (SNP rs3740564 in GRK5) and Xi 96 
(SNP rs7387461 in NAT1). For ASPR and penalized logistic regression with elastic net, additional /3's are interesting but 
not consistent across the three approaches. This may suggest that these are false positive results. 



UJJ www.sim.org 
Prepared using simauth.cls 



Copyright © 201 1 John Wiley & Sons, Ltd. 



Statist. Med. 2011, 00 1-12 



B. Zhu, D. B. Dunson and A. E. Ashley-Koch 



Statistics 
in Medicine 



6. Discussion 

In this article, we propose an adverse subpopulation regression model for investigating the relationship among multiple 
quantitative outcomes and high-dimensional predictors. Unlike the traditional two-stage methods, the proposed method 
does not require dichotomizing the continuous outcomes into binary indicators and thus avoids information loss. Two stage 
methods are outperformed with smaller MSE and higher area under the ROC curve for variable selection as demonstrated 
by the simulation studies. The new model has been applied to examine the effect of gene and environment interaction on 
adverse pregnancy outcomes. The results suggest the gene GRK5 and NAT1 may influence the occurrence of low birth 
weight and preterm delivery. 

Our focus is on defining a simple approach for assessing the impact of high-dimensional predictors on the risk of an 
adverse outcome when data consist of multiple quantitative traits. By using a two component mixture model, we can 
use a binary response logistic regression model, a framework that is very familiar to epidemiologists, to characterize 
non-linear genetic and environment associations with potentially complex multivariate quantitative traits. The proposed 
framework provides a parsimonious alternative to normal linear regression and logistic regressions based on preliminary 
categorization of quantitative traits, and should be able to detect associations that would not be detected with these 
methods. The proposed ASPR framework has purposefully been chosen to be a simple and parsimonious model that 
is easy to interpret and is scalable to high-dimensional predictors. For sake of simplicity and parsimony, we have avoided 
fully nonparametric Bayesian density regression models [26] that allow unknown numbers of latent classes. Although 
generalizations in such directions are conceptually straightforward, for each additional latent class, one introduces 
an additional p regression coefficients and corresponding hyperparameters, and difficult issues in identifiability, label 
switching and computational complexity arise. For our pregnancy outcome application, the ASPR model provide a good 
fit to the data, as illustrated by the posterior predictive density. 

7. Supplementary Materials 

Web Appendix and Web Figure referenced in Sections 5 are available at the Statistics in Medicine website 
http://onlinelibrary.wiley.com/journal/10 . 1002/ (ISSN) 1097-0258. 
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